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We consider the problem of simultaneous variable selection and estimation in additive, partially 
linear models for longitudinal/clustered data. We propose an estimation procedure via poly- 
nomial splines to estimate the nonparametric components and apply proper penalty functions 
to achieve sparsity in the linear part. Under reasonable conditions, we obtain the asymptotic 
normality of the estimators for the linear components and the consistency of the estimators for 
the nonparametric components. We further demonstrate that, with proper choice of the regular- 
ization parameter, the penalized estimators of the non-zero coefficients achieve the asymptotic 
oracle property. The finite sample behavior of the penalized estimators is evaluated with simu- 
lation studies and illustrated by a longitudinal CD4 cell count data set. 

Keywords: additive partially linear model; clustered data; longitudinal data; model selection; 
penalized least squares; spline 

1. Introduction 

In the past two decades, there has been a considerable amount of research to study addi- 
tive, partially linear models (APLM); see Opsomer and Ruppert [27], Hardle, Liang and 
Gao [12], Li [15], Fan and Li [9], Liang et al. [18], Liu, Wang and Liang [21], Ma and 
Yang [24], among others. APLMs meet three fundamental aspects (Stone [29]) of sta- 
tistical models: flexibility, dimensionality and interpretability. In this paper, we consider 
the APLMs for clustered and longitudinal data. 

Let {{Yij , Xij , Zij), l<i<n,l<j < mi} be the jth observation for the ith subject or 
cluster, where Yij is the response variable, X^j = (1, Xiji, . . . ,Xij((jj_i))'^ is a di-vector 
of covariates, and = [Ziji, . . . , Zijd.^^- is a (i2-vector of covariates. An APLM for this 
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kind of data is given by 

Yij ^f^tj +eij ^Xjjf3 + ^T]i{Z.,ji) + e-tj, j ^ I, . . . ,m,i,i ^1, . . . ,n, (!) 

1=1 

where /3 is a di -dimensional regression parameter, and rji, / = !,..., d2, are unknown but 
smooth functions. We assume = (ea, • ■ • j^imj"^ ^ N{0,'Ei). For identifiability, both 
the parametric and nonparametric components must be centered, that is, Er]i(Ziji) = 0, 
I = 1,. . . ,d2, EXijk =0, fc = 1, . . . , di. When d2 = 1, model (1) is simplified to be the 
partially linear model (PLM) in Lin and Carroll [20]. Model (1) retains the merits of 
additive models, while it is more flexible than purely additive models by allowing a subset 
of the covariates to be discrete and/or unbounded. When to^s and S^s are the same for 
all individuals, Carroll et al. [3] considered the efficient estimation of /3 in model (1) 
using local linear smooth backfitting. In this paper we consider a more general scenario 
that both rrii and Sj may vary across subjects or experimental units to allow irregular 
measurements for individuals. Our goal is to simultaneously select significant variables 
and efficiently estimate the unknown components for model (1). This is challenging due 
to the issue of "curse of dimensionality" and the additional complexity of the correlation 
structures (Wang [34]) introduced by repeated measurements. 

To alleviate the effect of the "curse of dimensionality," more parsimonious models be- 
come desirable in practice; see Fan [10], HaU, Miiller and Wang [11] and Wang et al. [32]. 
Variable selection is fundamental to high-dimensional statistical modeling. In the ab- 
sence of prior knowledge, a large number of variables may be included at the initial 
stage of modeling in order to reduce possible model bias. This may lead to a compli- 
cated model including many insignificant variables, resulting in less predictive powers 
and difficulty in interpretation. There is an extensive literature on variable selection via 
various approaches, for example, the classical information criteria such as the Akaike 
information criterion (AIC) and Bayesian information criterion (BIC) in Yang [40], the 
least absolute shrinkage and selection operator (LASSO) proposed in Tibshirani [30, 31], 
the non-negative garrote in Yuan and Liu [41], the difference convex algorithm in Wu 
and Liu [36], the combination of Lq a-nd Li penalties in Liu and Wu [22], and the non- 
parametric independence screening procedure in Fan, Feng and Song [6]. 

Many traditional variable selection procedures in use, including stepwise selection, AIC 
or BIC, can be expensive in computation and ignore stochastic errors inherited in the 
variable selection process. Penalized least squares approaches have gained popularity in 
recent years to automatically and simultaneously select significant variables; for exam- 
ple, Antoniadis [1] proposed the hard thresholding penalty which enables best subset 
selection and stepwise deletion in certain cases. The LASSO (Tibshirani [30, 31]) is one 
of the most popular shrinkage estimators, but it has some deficiencies (Meinshausen and 
Biihlmann [26]). Fan and Li [7] proposed the smoothly clipped absolute deviation penalty 
(SCAD), which achieves an "oracle" property in the sense that it performs as well as if 
the subset of significant variables were known in advance. The SCAD-penalized selection 
procedures were illustrated in Fan and Li [7] for parametric models; Cai et al. [2] and 
Fan and Li [8] for survival models; Li and Liang [16] for generalized varying-cocfficicnt 
models; Liang and Li [17] and Ma and Li [25] for measurement error models; Xue [37] for 
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pure additive models; and Xue, Qu and Zhou [38] for generalized additive models with 
correlated data. 

We propose a model selection method for APLMs with repeated measures by penalizing 
appropriate estimating functions. We approximate nonparametric components by spline 
functions and obtain asymptotic normality for the coefficient estimators via one step least 
squares. The proposed approach is computationally expedient and easy to implement, in 
contrast to the backfitting approach in Carroll et al. [3]. Moreover, it avoids the pitfall 
of the backfitting algorithms caused by dependence between covariates. Furthermore, 
we show that the estimator can correctly select the nonzero coefficients with probability 
converging to 1 and the y^-consistent estimators of the non-zero coefficients can perform 
as well as an oracle estimator in the sense of Fan and Li [7] with a suitable choice of 
penalty function. 

The paper is organized as follows. In Section 2, we introduce the penalized polynomial 
spline estimating method. Section 3 provides the asymptotic properties of the proposed 
estimators, including the consistency and oracle property of the parametric components, 
as well as the rate of the L2-convergence of the nonparametric components. In Section 4, 
we discuss some implementation issues of the proposed procedure. Simulation studies are 
presented in Section 5. Section 6 illustrates the application using longitudinal CD4 cell- 
count data. We conclude with a discussion in Section 7. Technical proofs are presented 
in the Appendix. 

2. Penalized spline estimation 

For simplicity, denote vectors Yj — {Yn, . . . ,limi)"^ and /x^ = (/^ii, • ■ • 1 < 'Tii < 

M, l<i<n. Similarly, let Xj = {(X,;i, . . . , Xi„J'^}„,xdi and Z, = {(Z^, . . . , Zi„J'^}„,xd2- 
Assume that Ziji has the same distribution as Zi, which is distributed on a com- 
pact interval [ai,bi],l < I < d2, and, without loss of generality, we take all intervals 
= [0,l],l<Z<d2. Let r^l{^a)^{m{Z^ll).■.■MZ^^dW, for l^l,...,d2. The 
mean function in model (1) can be written in matrix notation as /x^ — + Vii'^u): 
which is a semiparametric extension of the marginal model in Liang and Zeger [19] with 
an identity link. 

As in Wang, Carroll and Lin [35], we allow X and Z to be dependent. Let = 
Vi(Xj,Zj) be the assumed "working" covariance of Y^, where V, = Ay^R^A^^ , Ai 
denotes a mi x rrii diagonal matrix that contains the marginal variances of Yij , and R, 
is an invertible working correlation matrix. Throughout, we assume that depends on 
a nuisance finite dimensional parameter vector ct. 

Following Wang and Yang [33], we approximate the nonparametric functions jyj's by 
polynomial splines. Let G„ be the space of polynomial splines of degree q>l. We intro- 
duce a sequence of spline knots 

t-q = • • • = t_i =<o=0<tl<---<tjv<l = tw+l = ■ ■ ■ = tj^+q+l, 

where N = Nn is the number of interior knots, and increases when sample size n in- 
creases with the precise order given in Assumption (A5). Then G„ consists of functions 
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w satisfying (i) tu is a pofynomial of degree q on each of the subintervals Ig = [ts,ts+i), 
s = 0, . . . ,Nn — 1, /jvn = [tN„, 1]; (ii) for g > 1, n7 is (g — 1) times continuously differen- 
tiable on [0,1]. In the following, let J„ = Nn + q + 1, and we adopt the normalized B-spline 
space G^j = {Bg^: 1 < I < d2,l < s < Jn}^ in Xue and Yang [39]. Equally spaced knots 
are used in this article for simplicity of proof. However, other regular knot sequences can 
also be used with similar asymptotic results. 

Suppose that i]i can be approximated well by a spline function in so that 

Vi{zi)^m{zi)=^jsiBsj{zi). (2) 

s=l 

Let 7 = {jsi : 1 < s < Jn, 1 < ' < '^2)'^ be the collection of the coefficients in (2), and let 

Biji ^[{Bs^i{Z,ji): 1 < s < J„}'^]j„xi, ={(B^-i,...,B^-^j'^}<j2j^xi; (3) 

then we have an approximation « ^JjP + '^Jj^- We can also write the approximation 
in matrix notation as fi^ « Xi/S + Bj7, where Bj = {(Bji , . . . , 'Bi„n )'^}m.iXd2J„ ■ 

Let /3 = (/3i, . . . and 7 = {%,;: s = 1, . . . , J„, Z = 1, . . . , ^2}'^ be the minimizer of 

n 

Qnil3,l) = ^ Ei^' - (X,/3 + B,7)}^Vri{Y, - (X,/3 + 6,7)}, (4) 



which is corresponding to the class of working covariance matrices {V^, 1 < i < n}, or, 
equivalently, they solve the estimating equations 

n 

E^?V-i{Y,-(X,/3 + B,7)} = 0, (5) 

n 

E^?V-i{Y, - (X,,/3 + B,7)} = 0. (6) 

i=l 

Solving (6) yields 

(n \ "-'^ n 

E^^V-^bJ EBTvri(Y^,_X./3). (7) 
i=l I i=\ 

Replacing 7 by 7(/3) in (4), we define 



1 " 

Q(/3) ^ Q„{/3,7(/3)} = ^Et^^ - {X./3 + B,7(/3)}]^ 

V-MY,-{X,/3 + B,7(/3)}]. 



2 ■ 

i=l 



(8) 
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To select the significant parametric components, we add a penalty to Q{(3). Let nx = 
S"=i "^ii ^"^^ define the penalized version of Q{(3) as 

Qvm=Qm + n^V{l3), (9) 

where P(/3) = X^fcLiPAtd/Sfel) for a pre-specified penalty function with a regular- 

ization parameter A. Minimizing Q-p{(3) in (9) yields a penalized estimator 

3^ = argminQp(/3). (10) 

Various penalty functions can be used for 'P{f3) in variable selection procedures. We 
consider two penalty functions, the hard thresholding penalty (Antoniadis [1]) p\{P) = 
A2 - - A)2/(|/3| < A)and the SCAD penalty (Fan and Li [7]), given by 

p' (/3) = xhm < A) + - /^)+ > x)\ for some a > 2 and /3 > 0, 
I (a-l)A J 

where J5a(0) = 0, and A and a are two tuning parameters. Justifying from a Bayesian 
statistical point of view. Fan and Li [7] suggested using a = 3.7, which will be used in 
our simulation studies. 

The minimization problem in (10) is essentially a one-step least squares problem, 
which can be easily solved and implemented with many existing regression programs. The 

theorems established in Section 3.3 demonstrate that /3 performs asymptotically as well 
as an oracle estimator in terms of selecting the correct model when the regularization 
parameter is appropriately chosen. 



3. Asymptotic properties of the estimators 

For positive numbers a„ and hn, n > 1, let a„ ^ 6„ denote that lim„_>oo an/b„ — c, where 
c is some non-zero constant. Let \4>\l2 = [/q dz]^/^ denote the L2 norm of any 

square integrable function (j){z) on [0,1]. Denote the space of the pth order smooth 
functions as C^p'>[0, 1] = {</> | </>(p) € C[0, 1]}. 



3.1. Assumptions 

The assumptions for the asymptotic results are listed below: 

(Al) The random variables Ziji are bounded, uniformly in 1 < j < rrii, 1 < i < n, 
I <l <d2. The marginal density fi{zi) of Zi has the uniform upper bound C/ 
and lower bound c/ on [0, 1]. The joint density fw {zi, zi>) of {Ziji, Ziji) satisfies 
that Cf< !iv{zuzv) < Cf, for all (z;,z;0 G [0, 1]^, 1 < / 7^ < ds- 

(A2) The random variables Xijk are bounded, uniformly in I < j < mi, I < i < n, 
1 < k < di. The eigenvalues of i?{XijX^ |Zij} arc bounded away from and 
infinity, uniformly in 1 < j <mi, 1 <i <n. 
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(A3) The eigenvalues of the true covariance matrices S.^ are bounded away from and 

infinity, uniformly in 1 <i <n. 
(A4) The eigenvalues of the working covariance matrices arc bounded away from 

and infinity, uniformly in I <i <n. 

To make /3 estimable at the rate, we need a condition to ensure that X and Z 
not functionally related. Define V. = {tp{z) = = 0, \1p1\L2 < 00} the 

Hilbcrt space of theoretically centered L2 additive functions on [0,1]''^. Let V'fc be the 
function ip CzH that minimizes 

n 

^i?[{xf -^.(z,)}^v,-Hxf^ -^(z.)}], 

i=l 

where 

Xf' = (X,ife,...,X„,.fc)T, l<fc<di. (11) 

Then 

(A5) bi l<l<d2, l<k< di, assume that 77,(2;) e C(p)[0, 1], Vfc G C(p)[0, 1] for a 
given integer p>l, and the spline degree satisfies q + l>p. The number of the 
spline basis functions J„ ^ n^/i^p) log(n). 

Assumptions (Al)-(A4) are identical with (C1)-(C4) in Huang, Zhang and Zhou [14], 
while Assumption (A5) is similar to (CI) and (C4) in Liu, Wang and Liang [21]. 

3.2. Asymptotic properties for the unpenalized estimators 

According to the equations in (5) and (6), we have 

(?) = (Es^v-d.) '{±nIv;^Y^, (12) 

where = (XiiBJnj. ^(di+daJn)- The centered additive component rii{zi) is estimated 
by the empirically centered estimator 

S = l 1=1 j = l 

Next we derive the asymptotic properties of (3 and rji. Let X and Z be the collec- 
tions of all XijkS and Zijis, respectively, that IS, ^rnxdi = (Xr,-.-,Xl)^ and 
(Z^,...,ZT)T. Define 

±f) = Xf'^ - VMi), 1 < < di, X, = (Xf', . . . , Xl'^^^)^,,,^, (14) 
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for l<i<n. Denote X = {(X^ , . . . ,X„)'^}„Txdi, 

V"^ =diag(Vj;\...,V^^)„^x«T: i: = diag(Si,...,S„)„^x«T- 
Further define 

0(V, T) = {A(V)}-iB(V, I){ A(V)}-i (15) 

with A(V) = E{n-^X^y-^X) and B(V, T) = E{n-^y.^y-^Ty-^X). 

The following result gives the asymptotic distribution of f3 for general working covari- 
ance matrices. 

Theorem 1. Under Assumptions (Al)-(A5), asn^oo, 

n^/2(3-/3)^N(0,f2(V,I)). 

Remark 1. It is easy to show that the covariance r2(V, 1.) in (15) is minimized by V = 
and in this case equals to {A(V)}~^. To construct the confidence sets for /3, J1(V, H) is 
consistently estimated by 

n(V,i) =n(X'^V"^X)"^(X'^V"iiV"iX)(X'^V"^X)^\ 

^ ^ T ^ T 

where X = {(Xi , ■ • • ,XrJ'^}nTX<ii, and 

X,=X,-ProjG.X„ z = l,...,n, (16) 
in which Proj^.. is the projection onto the empirically centered spline space. 

Remark 2. The result of Proposition 2 with identity link in Wang, Carroll and Lin [35] 
is a special case of Theorem 1 with Vi = • • • = V„ = V, 7711 = • • • = 771,1 = J^I and d2 = 1- 

The next theorem shows that the estimated function rfi in (13) is L2-consistent. 

Theorem 2. Under Assumptions (Al)-(A5), \rji — riiW_^ = Op{J;^~^^ + (J„/n)}, for 
l<l<d2. 

3.3. Sampling properties for the penalized estimators 

We next show that with a proper choice of Xk, the penalized estimator f3 has an or- 
acle property. To avoid confusion, let /3q be the true value of /3. Let r be the number 
of non-zero components of /3q. Let = (/3io, . . . , /Sdio)""" = (/^lO' /^Jo)"^' where /J^o is 
assumed to consist of all r non-zero components of /3g, and /320 ~ ^ without loss of 
generality. In a similar fashion to /3, we can write the collections of all parametric com- 
ponents, X = (XT,XT)T^ X = (XT,xT)T. Denote a„ = maxi<fc<d J|/3fco|)|, /3fco ^ 0}, 
Wn=maxi<k<di{\P\^{\Pko\)\,l3kO7^0}- 
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Theorem 3. Under Assumptions (A1)-(A5), and if an — > and Wn — > as oo, then 
there exists a local solution f3 in (10) such that its rate of convergence is Op{n~^^^ +a„). 

Next define a vector k„ = {p'^ (|/3io|) sgn(/3io), . . .^p'x {\Pro\) sgn(/3ro)}'^ and a diagonal 
matrix Sa = diagjp';,; (|/3io|), . . .\p^^(]/3.o|)}.^We further denote = Var(YJXi„ Z,), 
Ii = diag(Sn, . . . , Si„), Ai(V) = E{XjV-^)^i) and Bi(V, Ei) = E{¥.JV-^EiV-^)^i). 

The theorem below shows that under regularity conditions, all the covariates with 
zero coefficients can be detected simultaneously with probability tending to 1, and the 
estimators of all the non-zero coefficients are asymptotically normally distributed. 

Theorem 4. Under Assumptions (A1)"(A5), if Imin^oo VnXkn ^ oo and 

liminf liminf A^„Va^ (l/3fc|) > 0, 

^P ^P 

then the ^/n- consistent estimator (3 in Theorem 3 satisfies P{fi2 = 0) — > 1, as oo, 
and 

V^{Ai(V) + Sa}[3i - /3io + {Ai(V) + Ea}-1k„] ^ N(0,Bi(V, Ti)). 



4. Implementation 

In this section, we illustrate how to implement the proposed method in the scmipara- 
metric marginal estimation and variable selection. Let 



SA(/3)=diag 



l/3i 



for a small number e (e = 10 ^ in our simulation studies). Applying the usual Taylor 
approximation, Q-p{(3) can be locally approximated by 

Q(/3) + Q{f3of{f3 - f3o) + - (3ofQ{f3o){f3 - /3o) + inT/3^EA(/3o)/3. 

By the local quadratic approximations for penalty functions (Fan and Li [7], Section 3.3), 
the solution can be found iteratively. 



/3 



(fe+i) _ 



^X^{Vf)}-^X, + nTS;,{/3('=)} 



where n„Yj is the projection of Y, onto the spline space G^, and is given in (16). 

Following Fan and Li [7], we derive a sandwich formula for the standard errors of the 

-p 

estimated covariates /3 



C^vif) = {Q0l) + nT^x0l)}-'C^y{Q0l)} X {Q(3a) + nT^x0l)r\ (17) 
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where W) = Etlt^^-'t^ and C^{Q(/3)} = ELiX>-^S,V-iX,. Applying 
conventional techniques that arise in the likelihood setting, we can show that the above 
sandwich formula is a consistent estimator and has good accuracy in our simulation study 
for moderate sample sizes. 

We use BIC to select the tuning parameters A = (Ai, . . . , AdJ- Let 

e(A) = tr{[Q0l) + UT^xi&r'Qifx)} 

be the effective number of parameters in the last step of the Newton-Raphson iteration. 
Then 

BIC(A) = logl^j^iy^ - M,)^Rr^(y. - M.) I + i^^e(A). 

The minimization problem over a d-dimensional space is difficult. However, Li and 
Liang [16] conjectured that the magnitude of Afe should be proportional to the stan- 
dard error of /3k- So we suggest taking Afe = A * SE(/3fe), in practice, where SE(/3fe) is the 
standard error of /3fe, the unpenalized estimator defined above. Thus, the minimization 
problem can be reduced to a one-dimensional problem, and the tuning parameter can be 
estimated by a grid search. 

5. Simulation 

In this section, we discuss finite sample properties of the proposed estimators via simu- 
lation studies. We simulated 100 data sets of size n = 100, 200 and 400 from the model 

Y^j ^f3'^Xij +T]i{Z.iji) + r]2{Z.ij2) + £tj, i = 1,. . .,7i,j = 1, . . .,3 (18) 

where the coefficients /3 = (3, 1.5, 0, 0, 2, 0, 0, 0)""", function ?7i(z) = sin27T(z — 0.5) and 
function 772(2) = z — 0.5 + sin{27T(z — 0.5)}. 

The 2- vector was generated from a bivariate normal distribution with mean 0, a 
common marginal variance 0.25 with correlation 0.9, but truncated to the unit square 
[0, 1]^. The covariates Xijk, fc = 1, . . . , 6, were generated independently from N(0, 0.25). 
Covariate Xijy = 3(1 — 2Ziji)(l — IZij^) + My, where Uij ^ N{0, 0.25) and is independent 
of Zij . Covariate ATyg was generated as —0.5 and 0.5 with equal probability. We generated 
£i = (cii, £i2, Eis) from N(0, S^;), whcrc = (1 — a)I + all'^ with 1 being a vector with 
all "1" and a = 0.9, that is, Y,e is exchangeable. 

Cubic B-splines were used to approximate the nonparametric functions as described in 
Section 2. We tried different numbers of knots (ranging from 2 to 10) and found that the 
choice of number of knots didn't make a significant difference in this simulation study. 
Our reported results in Tables 1 and 2 were based on using 4 equally spaced knots. 

To the simulated data sets, we applied the proposed method for estimation and vari- 
able selection. To study how the structure of the working correlation could affect our 
estimation and variable selection results, we considered the following three correlation 
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Table 1. Model selection and estimation: the average number of correct (C) and incorrect (I) 
Os, MRME (%) and RMSE 



n 


Penalty 


EX 








AR(1 


) 






WI 








C 


I 


MRME 


RMSE 


C 


I 


MRME 


RMSE 


C 


I 


MRME 


RMSE 


100 


SCAD 


4.67 





80.63 


0.1592 


4.64 





84.65 


0.1727 


4.64 





82.38 


0.5883 




HARD 


4.80 





85.90 


0.1691 


4.70 





86.56 


0.1916 


4.85 





85.81 


0.4410 




ORACLE 


5.00 





77.23 


0.1586 


5.00 





73.40 


0.1723 


5.00 





70.71 


0.4126 


200 


SCAD 


4.72 





76.30 


0.1053 


4.72 





81.63 


0.1127 


4.70 





79.99 


0.3921 




HARD 


4.79 





82.81 


0.1116 


4.71 





82.18 


0.1252 


4.98 





86.15 


0.2816 




ORACLE 


5.00 





66.96 


0.1038 


5.00 





66.18 


0.1110 


5.00 





70.86 


0.2787 


400 


SCAD 


4.92 





84.91 


0.0733 


4.86 





84.50 


0.0864 


4.88 





85.78 


0.2689 




HARD 


4.93 





91.23 


0.0758 


4.87 





85.65 


0.0924 


4.90 





91.08 


0.2021 




ORACLE 


5.00 





68.33 


0.0731 


5.00 





66.91 


0.0860 


5.00 





71.52 


0.1857 



structures: the correct exchangeable working correlation structure (EX), working inde- 
pendence (WI) and AR (1) structures. Table 1 summarizes the estimation and variable 
selection results with two types of penalty functions: SCAD and HARD. The average 
number of zero coefficients is reported in Tabic 1, in which the column labeled "C" 
presents the average restricted only to the true zero coefficients, and the column labeled 
"I" shows the average of numbers erroneously set to zero. The rows with "SCAD" and 
"HARD" stand, respectively, for the penalized least squares with the SCAD and HARD 
penalties. The oracle estimates always identify the 5 zero coefficients and 3 non-zero co- 
efficients correctly. The medians of relative model errors (MRME) as suggested in Fan 
and Li [7] and the root mean squared errors (RMSE) of the estimated coefficients over 
100 simulated data sets are also reported in Table 1. 

From Table 1, one sees that the choice of correlation structure has little impact on the 
results of variable selection: the number of correctly identified zero coefficients are all 
close to 5 regardless the correlation structure; and none of the nonzero coefficients were 
erroneously set to in any scenario. Table 1 also shows that the estimators with correct 
working correlation have the smallest RMSEs, thus are more efficient than those esti- 
mators with misspecified working correlation structures. The efficiency of the estimators 
based on the AR(1) is close to those based on EX, but there seems to be some signif- 
icant loss of efficiency for the estimators based on the WI structure which ignores the 
within subject/cluster correlation. In terms of choosing penalty functions, we find that 
both HARD and SCAD perform very well and the corresponding MRME and RMSE are 
comparable to those of the ORACLE. 

We also tested the accuracy of our standard error formula based on (17). The median 
absolute deviation (MAD) divided by 0.6745 (denoted by SD in Table 2) of 100 estimated 
coefficients from the 100 simulations can be regarded as the true standard error. The me- 
dian of the 100 estimated SDs (denoted by SDm) and the MAD error of the 100 estimated 
standard errors divided by 0.6745 (denoted by SD^ad) gauge the overall performance of 
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Table 2. Simulation results on standard error estimation for the non-zero coefficients , /32 , /Js ) 



Pi 02 ft 

n Penalty SD SD^ SD^.d SD SD„, SD„,ad SD SD^, SD„,d 



EX 



100 


SCAD 


0, 


.0889 


0. 


.0906 


0. 


.0141 


0, 


.1082 


0, 


.0911 


0, 


.0116 


0, 


.1034 


0. 


.0894 


0. 


.0111 




HARD 


0, 


.0879 


0, 


.0907 


0. 


.0118 


0, 


.1102 


0, 


.0911 


0, 


.0113 


0, 


.0982 


0. 


.0897 


0. 


.0108 




ORACLE 


0, 


.0866 


0, 


.0988 


0. 


.0062 


0, 


.1066 


0, 


.0899 


0, 


.0112 


0, 


.1012 


0. 


.0903 


0. 


.0088 


200 


SCAD 


0, 


.0655 


0. 


.0638 


0. 


.0035 


0, 


.0616 


0, 


.0629 


0, 


.0036 


0, 


.0594 


0. 


.0633 


0. 


.0036 




HARD 


0, 


.0655 


0. 


.0637 


0. 


.0033 


0, 


.0627 


0, 


.0630 


0, 


.0033 


0, 


.0600 


0. 


.0632 


0. 


.0035 




ORACLE 


0, 


.0648 


0. 


.0699 


0. 


.0078 


0, 


.0614 


0, 


.0637 


0, 


.0034 


0, 


.0594 


0. 


.0629 


0. 


.0036 


400 


SCAD 


0, 


.0414 


0. 


.0445 


0. 


.0043 


0, 


.0379 


0, 


.0445 


0, 


.0041 


0, 


.0415 


0. 


.0449 


0. 


.0042 




HARD 


0, 


.0414 


0. 


.0446 


0. 


.0043 


0, 


.0373 


0, 


.0445 


0, 


.0041 


0, 


.0418 


0. 


.0449 


0. 


.0042 




ORACLE 


0, 


.0412 


0. 


.0485 


0. 


.0086 


0, 


.0368 


0, 


.0443 


0, 


.0049 


0, 


.0404 


0. 


.0445 


0. 


.0046 


AR(1) 








































100 


SCAD 


0, 


.0983 


0. 


.0923 


0. 


.0141 


0, 


.1035 


0, 


.0940 


0, 


.0129 


0, 


.1153 


0. 


.0924 


0. 


.0132 




HARD 


0, 


.0996 


0, 


.0920 


0. 


.0160 


0, 


.1073 


0, 


.0939 


0, 


.0130 


0, 


.1117 


0. 


.0924 


0. 


.0128 




ORACLE 


0, 


.0976 


0, 


.0972 


0. 


.0097 


0, 


.0971 


0, 


.0915 


0, 


.0124 


0, 


.1173 


0. 


.0930 


0. 


.0122 


200 


SCAD 


0, 


.0635 


0. 


.0646 


0. 


.0041 


0, 


.0539 


0, 


.0634 


0, 


.0047 


0, 


.0647 


0. 


.0639 


0. 


.0045 




HARD 


0, 


.0632 


0, 


.0645 


0. 


.0045 


0, 


.0544 


0, 


.0634 


0, 


.0044 


0, 


.0626 


0. 


.0639 


0. 


.0045 




ORACLE 


0, 


.0624 


0. 


.0689 


0. 


.0073 


0, 


.0535 


0, 


.0646 


0, 


.0049 


0, 


.0657 


0. 


.0635 


0. 


.0048 


400 


SCAD 


0, 


.0452 


0. 


.0448 


0. 


.0056 


0, 


.0390 


0, 


.0451 


0, 


.0055 


0, 


.0535 


0. 


.0452 


0. 


.0052 




HARD 


0, 


.0451 


0. 


.0449 


0. 


.0057 


0, 


.0390 


0, 


.0451 


0, 


.0055 


0, 


.0537 


0. 


.0452 


0. 


.0053 




ORACLE 


0, 


.0454 


0. 


.0477 


0. 


.0061 


0, 


.0392 


0, 


.0448 


0, 


.0051 


0, 


.0539 


0. 


.0450 


0. 


.0056 



WI 



100 


SCAD 


0. 


.2177 


0. 


,2364 


0, 


,0164 


0, 


,2192 


0, 


,2381 


0, 


,0187 


0. 


,2341 


0. 


,2375 


0, 


,0189 




HARD 


0. 


,2235 


0, 


,2364 


0, 


,0151 


0, 


,2185 


0, 


,2396 


0, 


,0169 


0. 


,2393 


0. 


,2389 


0, 


,0193 




ORACLE 


0. 


.2239 


0, 


,0579 


0, 


,1623 


0, 


,2118 


0, 


,2341 


0, 


.0165 


0. 


,2218 


0. 


,2374 


0, 


,0181 


200 


SCAD 


0. 


,1864 


0. 


,1674 


0, 


,0204 


0, 


,1697 


0, 


,1677 





.0188 


0. 


,1328 


0. 


,1671 


0, 


,0186 




HARD 


0. 


,1876 


0, 


,1675 


0, 


,0201 


0, 


,1676 


0, 


,1680 


0, 


.0180 


0. 


,1385 


0. 


,1676 


0, 


,0181 




ORACLE 


0. 


,1836 


0, 


,0415 


0, 


,1242 


0, 


,1656 


0, 


,1669 


0, 


.0162 


0. 


,1356 


0. 


,1671 


0, 


,0165 


400 


SCAD 


0. 


,0957 


0. 


,1162 


0, 


,0131 


0, 


,1055 


0, 


,1163 


0, 


.0117 


0. 


,1252 


0. 


,1164 


0, 


,0131 




HARD 


0. 


,0956 


0. 


,1162 


0, 


,0129 


0, 


,1042 


0, 


,1165 


0, 


.0121 


0. 


,1222 


0. 


1163 


0, 


,0128 




ORACLE 


0. 


,0956 


0, 


,0289 


0, 


,0778 


0, 


,1069 


0, 


,1160 


0, 


.0110 


0. 


,1227 


0. 


,1162 


0, 


,0104 



the standard error. Table 2 presents the standard errors for non-zero eoefficients when 
the sample size n ~ 100, 200 and 400. It suggests that the sandwich formula performs 
satisfaetorily for SCAD and HARD penalties. The standard errors based on the SCAD 
and HARD penalty functions are closer to those of the ORACLE as n increases. Similarly 
to the RMSE results shown in Table 1, Table 2 also shows that the estimation procedures 
with a correct EX working correlation are more efficient than their counterparts with WI 
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working correlation. Estimation based on a misspecified AR(1) correlation structure will 
lead to some efficiency loss, but it is quite close to using the true EX structure. 

6. Application 

To illustrate our method, we considered the longitudinal CD4 cell count data among HIV 
seroconverters. This dataset contains 2376 observations of CD4 cell counts on 369 men 
infected with the HIV virus; see Zeger and Diggle [42] for a detailed description of this 
dataset. Both Wang, Carroll and Lin [35] and Huang, Zhang and Zhou [14] analyzed the 
same dataset using a PLM. Their analysis aimed to estimate the average time course of 
CD4 counts and the effects of other covariates. In our analysis, we fit the data using an 
APLM, with the square root transformed CD4 counts as the response, and covariates 
including AGE, SMOKE (smoking status measured by packs of cigarettes), DRUG (yes, 
1; no, 0), SEXP (number of sex partners), DEPRESSION (measured by the CESD scale) 
and YEAR (the effect of time since seroconversion). To take advantage of flexibility 
of partially linear additive models, we let both DEPRESSION and YEAR be modeled 
nonparametrically, the remaining parametrically. It is of interest to examine whether 
there are any interaction effects between the parametric covariates, so we included all 
these interactions in the parametric part. 

For the working variance, we considered the WI, the AR(1) and the "random intercept 
plus serial correlation and measurement error" covariance (RSM) in Zeger and Diggle [42]. 
One can obtain the RSM structure by fitting a full model to the data and inspecting the 
variogram of the residuals. Wang, Carroll and Lin [35] and Huang, Zhang and Zhou [14] 
also analyzed this data set using the RSM structure. More precisely, the working covari- 
ance matrices are specified by r^I + v^J + w^H, where I is an identity matrix, J is a 
matrix of Is and H(j, j'') = exp(— ajYEARy — YEARy/j). We used the covariance pa- 
rameters (r^, 1/2, w^, a^) = (11.32, 3.26, 22.15, 0.23) calculated by Wang et al. [35]. Table 3 
gives the estimates of the regression coefficients using WI, AR(1) and RSM covariance 
structures. The standard errors (SE) were all calculated using the sandwich method. We 
used cubic splines of 4 knots selected by the five-fold delete-subject-out cross-validation 
from the range of 0-20. We refer the reader to Huang, Wu, and Zhou [13] for the detail 
of the delete-subjects-out i^-fold cross-validation. The left panel of Table 3 reports the 
estimation using full model, and the selection results are shown in the right panel. 

We further applied the proposed approach to select significant variables. We used the 
SCAD penalty, the tuning parameter A = 0.4549, 0.2829, 0.3143 for WI, AR(1) and RSM 
covariance structure, respectively. The results are also shown in Table 3. Under both 
WI and RSM structures, SMOKE, DRUGS, SEXP, SOMKE*SEXP and DRUGS*SEXP 
are identifies as significant covariates. One notes some slight selection difference when 
AR(1) structure is used, which suggests that SMOKE*DRUGS may also be significant. 
Although the selection procedure is not sensitive to the choice of covariance structure 
as shown in our simulation study, different covariance structures may still lead to slight 
different results. Therefore, it is important for one to choose a covariance structure close 
to the true one. We also find some significant interactions among some covariates which 
may be ignored by Wang, Carroll and Lin [35] and Huang, Zhang and Zhou [14]. 



Table 3. Estimated coefficients for CD4 dataset 



Fuii Penaiized 



Variabie Wf AR(1) RSM WI AR(1) RSM 

P (SE(;3)) p (SE(/3)) p (SE(/3)) /? (SE(/?)) /? (SE(/?)) p (SE(/?)) 



fNTERCEPT 


24 


.365 


(0, 


.417) 


24.540 


(0, 


.480) 


24.819 


(0.494) 


24. 


,487 


(0.391) 


24, 


.454 


(0, 


,464) 


24, 


,793 


(0, 


,461) 


AGE 


-0, 


.013 


(0. 


,035) 


-0.023 


(0, 


.045) 


-0.049 


(0.049) 





(0) 







(0) 









(0) 






SMOKE 


1, 


.070 


(0. 


.234) 


0.825 


(0, 


.259) 


0.654 


(0.264) 


0. 


733 


(0.148) 


0, 


.824 


(0, 


,247) 


0, 


.424 


(0, 


.176) 


DRUG 


2 


.671 


(0. 


.491) 


1.958 


(0, 


.517) 


1.468 


(0.507) 


2. 


,486 


(0.454) 


2 


.025 


(0, 


,511) 


1, 


.340 


(0, 


.462) 


SEXP 





.165 


(0. 


.082) 


0.153 


(0, 


.084) 


0.109 


(0.082) 


0. 


,170 


(0.076) 


0, 


.174 


(0, 


,080) 


0, 


.144 


(0, 


.078) 


AGE*SMOKE 


-0 


.014 


(0, 


.011) 


0.000 


(0, 


.015) 


0.002 


(0.017) 





(0) 







(0) 









(0) 






AGE*DRUG 





.043 


(0, 


.036) 


0.008 


(0, 


.042) 


0.010 


(0.044) 





(0) 







(0) 









(0) 






AGE*SEXP 





.001 


(0, 


,005) 


0.005 


(0, 


.005) 


0.008 


(0.005) 





(0) 







(0) 









(0) 






SMOKE*DRUG 


-0, 


.402 


(0, 


,233) 


-0.331 


(0, 


.246) 


-0.288 


(0.248) 





(0) 




-0, 


.337 


(0, 


,241) 





(0) 






SMOKE*SEXP 


0, 


.058 


(0, 


,024) 


0.043 


(0, 


.026) 


0.047 


(0.025) 


0. 


,051 


(0.023) 


0, 


.045 


(0, 


,025) 


0, 


.046 


(0, 


.025) 


DRUG*SEXP 


-0 


.364 


(0, 


,087) 


-0.251 


(0, 


.086) 


-0.17 (0.083) 


-0. 


,355 


(0.084) 


-0, 


.265 


(0, 


,084) 


-0, 


.186 


(0, 


.081) 
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10 20 30 40 50 -2 2 4 

CESD YEAR 

Figure 1. The estimates of tlie nonpararmetric components: rfi and 772- The sohd, dotted and 
dashed curves correspond to the estimates under WI, AR(1) and RSM structures. 

The nonparametric curve estimates using the WI (solid hue), AR(1) (dotted line) and 
RSM (dashed line) estimators are plotted in Figure 1 for "DEPRESSION" and "YEAR." 
One can see that it is more reasonable to put "DEPRESSION" as a nonparametric 
component. 

7. Discussion 

We have developed a general methodology for simultaneously selecting variables and 
estimating the unknown components in APLMs for longitudinal and clustered data. We 
propose a one-step least squares approach to obtain the estimation of both the parametric 
and nonparametric components based on polynomial spline smoothing. This approach is 
flexible, computationally simple and very easy to implement in practice. We demonstrate 
that the asymptotic normality of the estimated coefficients for the linear part is retained. 
The proposed penalized regression method also achieves an "oracle" property in the sense 
that it performs as well as if the subset of significant parametric components were known 
in advance. 

In this paper, our primary interest is the linear components, and we treat the non- 
parametric functions as nuisance components; thus we limit our discussions to estimation 
and variable selection for the linear part. Nonetheless, this may be extended to the non- 
parametric components using techniques similar to those in Xue [37]. An anonymous 
referee pointed out the feasibility of obtaining the asymptotic "oracle" property of the 
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nonparametric components in Ma and Yang [24]. We believe that this property can be 
similarly obtained via a two-step spline backfitted kernel smoothing procedure (Ma and 
Yang [24]). However, the technical details deserve careful consideration, and this is an 
interesting topic of future research. 

The simulation result indicates that the variable selection is consistent even if the 
correlation structure is misspecified. However, misspccification may lead to some effi- 
ciency loss. So, it would be desirable if one could choose an appropriate correlation 
structure based on available data in practice. The simulation results clearly show that 
there is marked improvement of efficiency when one uses the correct correlation structure 
though the variable selection seems to be consistent with misspecified structure. To select 
the correlation matrix, one might consider some resampling-based methods, such as the 
bootstrap and cross-validation methods in Pan and Connett [28] and other techniques in 
Diggle et al. [5]. There is, however, a clear need to formalize the procedures with solid 
theoretical justification. Instead of modeling the correlation through the "working" cor- 
relation matrix, one could also nonparametrically model the variance-covariance as some 
unknown smooth function (Chiou and Miiller [4]). This is an excellent research problem 
for future study. 

Appendix 

For any vector x= (xi, . . . , x^)"'", we denote j| • || the usual Euclidean norm, that is. 



ll^ll = \/j2k=i •^fc' ''^'^'^ II ■ lloo the sup norm, that is, ||x||oc^ = supj<j.<^ \xk\- For any func- 
tions 0, if, let 0(Xi; Zj) and ^Q£.i,ZLi) be TOi-vectors; then define the empirical inner prod- 
uct and the empirical norm as {<j)^ip)n = {(j),if)n,'v = 4'0Li7ZLi)"^^i^ fOLiiZa), 
W^Wn — (0:0)": fo^' tlic workiug covariance V^. Further denote Eniif) = 
ij^. V,^^(/)(Xj, Zj). If functions 0, 1^9 arc L^-integrable, we define the theoret- 
ical inner product and its corresponding theoretical norm as (</>, <p) = E({(f),ip)n), 
1)011^ = i?(]|(/)]|^). Let II„ and H„ denote, respectively, the projection onto relative to 
the empirical and theoretical inner products. For convenience, let h ~ hn ^ J^^ and 
be the d x d identity matrix. 

A.l. Proof of Theorem 1 
Lemma A.l. Define 




A„ = sup |(5i,.92>,i - (gi,g2)|||.9i|| ^||.92| 



1 



si.S2eG! 



Bn= max sup ll|.Tfc-.gl|yi|2;fe -.911^-1], 



l<fe<dl g£G\\ 



then An = Op{y/\og{n)/{nh'^)} and B„ = Op{^log(n)/(n/i2)}. 
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Lemma A.l can be proved similarly to Lemmas A2 and A3 in Huang, Zhang and 
Zhou [14] and are thus omitted. 

To obtain the closed-form expression of /3, we need the following block form of the 
inverse of X^iLi Ei^^r^Si- 



/ n n \ —1 



i=i 



\ i=l i=l ) 



Hxx Hxb\ ^ _ /H^^ H^-^ 
Hbx Hbb / \H^^ H-^^ 



(A.l) 



where H" = (Hxx - HxbHb^Hbx) \ H^^ = (Hbb - HbxHx^Hxb) \ H^^ = 
-H"HxbH-^ and H^i = -H22HbxHxx- Consequently, 

{n n ^ 

E^^VriY, - HxbHb^ E BTy-iY, . (A.2) 
i=\ i=\ ) 

Lemma A.2. Under Assumptions (A1)-(A5), /or Hbb 'in (A.l), one has (i) there exist 
constants < ch < Ch , C'^ = , = such that 

CHld,j„ < Ein'^UBB) < CHld,j„; (A.3) 
(ii) with probability approaching 1 as ri — > oo , 

c_f/Id2J„ < n-^nBB < Cnld^J^- (A.4) 

Since the proof of Lemma A.2 is a little complicated, we provide it in the supplemental 
article (Ma, Song and Wang [23]). The proofs of Lemmas A.3 to A. 7 below are also 
provided in (Ma, Song and Wang [23]). 

Lemma A.3. Define U= {Y.'Li'^i^i)d^ J„xdi, where B; is given in (3). Under As- 
sumptions (Al)-(A5), there exist constants Q < cjj < Cjj < oo, such that with probability 
approaching 1 as n—> oo, cijldi < (n^^/i)U'^U < Cijidi- 

Lemma A.4. Under Assumptions (Al)-(A5), there exist constants < cri < Chi < oo, 
such that with probability approaching 1 as n — > oo, c^ildi < nH^^ < C^ildi , where H^^ 
is given in (A.l). 

Let (3^ and /3g be the solutions of (A.2) with Yj replaced by /x^ and Gj = Yj — /x^, 
respectively. Then /3 — /3q = (/3^ — /3o) + /3g. 

Lemma A. 5. Under Assumptions (Al)-(A5), ||/3^ — /3q|| = op(n^-'-/^). 
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Note that /3e =H"{Er=i2L?V£ie, -HxBHB^^;'^,B7V-ieJ; thus we can show 
that the conditional variance Var(/3g|X,Z) equals 

n 

H"5]{X, -B,H3^HBx}^VriS,V-i{X, -B,Hb^Hbx}H". (A.5) 

1=1 

Lemma A.6. Under Assumptions (Al)-(A5), asn^oo, 

{Var(^jX,Z)}-i/2(^J -->iV(0,IdJ. 

Lemma A. 7. Under Assumptions (Al)-(A5), for the covariance matrix r2(V,2I) defined 
in (15), c*yld, < n(y,T) < C^Id, andVar(;3jX,Z) ==n-iJ7(V,I) + Op(7i-^V2 + „-i/j2p)_ 

Theorem 1 follows from Lemmas A.5, A.6 and A. 7. 



A. 2. Proof of Theorem 2 

From (12) and (A.l), we obtain 

Cn n 
Y^BjVr^Y, - HBxHxk E^^Vri Y, 
1=1 1=1 



(A.6) 



Following the same idea as that in the proof of Lemma A. 4, we have that there ex- 
ist constants < c^f^ < < oo, such that with probability approaching 1 as n — >■ cxj, 
CH2^d2J„ < nH."^^ < C/fjIdaJ,,- Letting 7^ and 7^ be the solutions of (A.6) with re- 
placed by fi. and e^ = Y^ — fi ., respectively, 7 — 7 = (7^ ~ 7) + 7e- Letting n„^x be the 
projection on {Xil^Li to the empirical inner product, 7^ — 7 equals 



H 



22 



EB^VrM E^niZu) -HBxH^kE^^Vr^K^^KZ.) 



.1=1 



. 1=1 



i=i 



. 1=1 



-7 



1=1 



£r?z(Z,z) -B,7 \ -n^^xlj^mi^u)-^,^ 



. 1=1 



. 1=1 



where S = (5ii, . . . , S'j^dJ, with 

n 

5.,=n-i^(B^'))X-i 

i=l 



. 1=1 



. 1=1 
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and b['''^ = [{Bs,i(^.i/),---,^«,;(^™.i)r]rn.xi. Let A7?(ZJ = Eti '?KZ.O - B,7, then 
the Cauchy-Schwarz inequahty imphes that 



1/2 



|A7;-n„.x(A77)||„ = Op(/if), 



thus ||7^ — 7II = Op{Jn^^hP). For any c e 7?.-^"'^^ with ||c|| = 1, wc write c^Je — Sr=i ^^^^i 
where are independent conditioning on (X,Z) and 

=cTh22{B, -X.Hx^HxB}^VriS,Vri{B, -X.Hx^Hxb}H22c. 

Following the same arguments as those in Lemma A. 6, we have maxi<i<„|ai| = 
Op(jy'n-i). Thus ||7ell < j'J^\c^le\ = Jn^^\Etl<^^^^\ = Op( J^'n-i/^) . Therefore, 
Il7i -7/11 = Op(jy'/if + Because mizi) = Br(z0^7i, mi^i) = ^ti^iV-fi and 

1^/ - m\h = Il7i - 7/f X Op(l) = Op( J„/i2p + J„n-i). Thus one has 

1^; - mil, < 2(157/ - 5?/lL + 1^' - ID = Op(-^n/i'P + J„n-i). 
A. 3. Proof of Theorem 3 

Let T„ = n~^/2 + a„. It suffices to show that for any given C > 0, there exists a large 
constant C such that 



P\ sup Qvif3„ + Tnu)>Qvif3o)}>l-C. 
M|u||=c J 



(A.7) 



Plugging 7(/3) in (7) into Q(/3) defined in (8), we have 



i=l 



X V 



'i=i 



Thus Q(/3) = i^;Li(Y. -X,/3-n„Y0TVri(Y, -X,/3-n„Y,). Let Un,i=Q{Po + 
r„u) - Q(/3o) and Un,2 = "t Z]fc=i{PAfc (|/3/cO + -PA^ (|/3/co|)}, where r is the number 

of components of /J^o- Note that PXk{0) = and PA^d/?!) > for aU p. Thus, Q-p{i3q + 

T„U) - Qvif^o) > C^n.l + Un,2- 

For C/„,i, we have Q(/9o +t„u) = g(;ao) + r„uTQ(/3„) + ^T^u^Q{f3*)u, where Q(/3) = 
Er=i^f Vrix„ f3* = i(/3o + n-i/2u) + (1 - t)f3o, t e [0, 1]. Note that 

n 

Q(/3o) = 5]X>-i(Y, - X,/3o-n„Y,) 
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1=1 U=l 1=1 J i=l 

where = — fj.^. Mimicking the proof for Lemmas A. 5 and A. 6, we have 

z-1 U^l /^l J 

n 

E^^V-i(e,-n„e,) = 0p(nV2). 

4=1 

Thus T„u"^(5(/3o) = Op(n-'^/^r„)||u|| . By the proof of Lemma A. 4, we obtain that \t'^\i^ x 
Q(/3o)u = Opinrl) + 0p(1). Thus 

t/n,i = Op(ni/V„) + Op(nr2) +op(l). (A.8) 

For Un,2, by a Taylor expansion, 

PA,(|/3fco +T„iifc|) =PA,(|/?fco|) +T„Ufcp';,J|/3fco|)sgn(/3fco) + (l/3fe I), 

where l3l^{l-t)l3ko+t{Pm + n-^'^Uk), [0,1] and 

PA,(|/3/cO + r„wfc|) ==pA,(|/3fco|) + rnWfePA,(l/3fco|)sgn(/3fco) + ^r^u^p^ J|/3fco|) + o(n-i). 

Thus, by the Cauchy-Schwarz inequaUty, 

r \ ^ 

n^^^U^a = T„ E Ufcp^^ (|/3fco|)sgn(/3fco) + "'pa*. d'^'^ol) 

A:=l fc=l 

< yFr„a„|lu|l + iT>„llull2 = Cr2(VF + w„C). 

As Wn 0, the first two terms on the right-hand side of (A.8) dominate Un,2 by taking 
C sufficiently large. Hence (A. 7) holds for sufficiently large C. 

A. 4. Proof of Theorem 4 

Wc first show that the estimator (3 must possess the sparsity property /Sj = 0, which is 
stated as follows. 

Lemma A.8. Under the conditions of Theorem 4, with probability tending to 1, for any 
given /3j satisfying that Wfi^ — /3io|| ~ Op{n^^/'^) and any constant C, 

Qv{{l3l,0^f}= min QvUf^l ,l3l)}. 

1 1 1 1 <r (^rj. ~ 1 / 2 
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Proof. To prove that the maximizer is obtained at = 0, it suffices to show that with 
probabihty tending to 1, as n — >■ oo, for any (3i satisfying \\f3i — fiioW = Op(n~^/^), and 
ll/Ssll < dQvi/3)/dl3k and have different signs for Pk e (-Cn-^/^ ,Cn-^/^), 

for fc = r + l,...,(ii. Note that 

where Qkif3) - Qfe(/3o) + Et=i Qkk'Wk' + (1 - i)/3ofc'}(/3fc' - Afc'), t S [0, 1], 



I' ' 

i=l 

It follows by the similar arguments as given in the proofs of Theorems 1 and 3 that 

n ( d2 d2 ^ n 

i=l U=l 1=1 ) i=l 

''ESfe(Y,,X„^J+op(n-V2)l 



— ni n 

[ i=l J 

where Sfc(Yj,Xi,Zj) is the kth element of matrix Xi ^Y^i&i ~ n„ej). According to 
Lemma A. 7, we have 

n-'Qif3,) ^ E(^n-^Y.£^^'^^ + op(l) = R + op(1), 

1 J^^. 

- V Qkk'Wk' - M = (f3- f3ofiRk + op(l)), 

n ^ — ^ 



k' = l 



where Rk is the kth column of R. Note that ||/3 — /3o|| = Op(n~^/^) by the assumption. 
Thus, n~^Qk{P) is of the order Op(n~^/^). Therefore, for any nonzero f3k and k = 
r + l,...,di, 



gp,fc(/3) nAfc„|A^„V,^.J|/?fc|) sgn(/3fc) + Op (^^^ 

Since liminf„_^oo liininf^^_i.o+ ^ZnP'xkS^^'^^'^ ^ ^ ^^^"^ V^'^kn oo, the sign of the deriva- 
tive is determined by that of /3k ■ Thus the desired result is obtained. □ 

Proof of Theorem 4. From Lemma A. 8, it follows that = 0. 

gp(/3) = g(/3o) + g(/3*)(/3 -/3„) +nTK,„(|/3fco|)sign(/?fco)}Li 
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Y.pLi\M) + opii)\iMi- M, 



where /3* = tf3Q + (1 — t)f3, t € [0, 1]. Using an argument similar to the proof of Theo- 
rem 3, it can be shown that there exists a f3i in Theorem 3 that is a root-n consis- 
tent local minimizer of Q-p{{0f ,0^ )^}, satisfying the penalized least squares equations 

'^P T 

Q-plKPi }"^] = 0. Mimicking the proofs for Lemmas A. 5 and A. 6 indicates that the 
left hand side of the above equation can be written as 

r). 

n ' ' 



'E^i«Vr'(e, - fine,) + {p'^^J|/3fco|)sign(/?fco)}Li +op(n-i/2) 

i=l 

+ 1^ ECv-^Xi,^ + op(l) I (3^ - /3io) 
+ lEPA.„(l/Sfco|) + op(l)l(3r -/3io)- 



Thus we have 

A 

( / 71 

•T 

k \ 4=1 / J 

Similar arguments to Lemmas A. 6 and A. 7 yield the asymptotic normality. □ 



E \n-^Y.^,,Yr^%,A + + op(l) - (3 
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Supplement to "Simultaneous variable selection and estimation in semipara- 
metric modeling of longitudinal/clustered data" (DOI: 10.3150/11-BEJ386SUPP; 
.pdf). We provide detailed proofs of Lemmas A. 2 to A. 7 stated in the Appendix. 
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